Setup Nvidia Modulus v22.03 on Sunbird using interactive GPU session#
As of 17 Apr 2022, the link to Modulus tutorial is bit secret. Here is the link: https://docs.nvidia.com/deeplearning/modulus/index.html
Installation#
It turns out that Conda environment is experiencing lots of issues, thus I will use Python virtual environments with out Jupyter lab.
Installing latest Python#
If we have a look at available versions of Python in Sunbird, it is very old. The latest version is 3.6.
[s.1915438@sl1 ~]$ ls /usr/bin/python*
/usr/bin/python /usr/bin/python2 /usr/bin/python2.7 /usr/bin/python2.7-config /usr/bin/python2-config /usr/bin/python3 /usr/bin/python3.6 /usr/bin/python3.6m /usr/bin/python-config
If we want to create a virtual environment with latest Python then we can use Python from within a conda environment.
Create a new Conda environment as follows. This will create a new conda environment with the latest python.
module load anaconda/2021.05
conda create --name modulus
source activate modulus
Let us check the Python version in the modulus environment.
(modulus) [s.1915438@sl1 ~]$ which python
/lustrehome/home/s.1915438/modulus/bin/python
(modulus) [s.1915438@sl1 ~]$ python --version
Python 3.9.12
We can use this python to create our Python virtual environment as follows. Also, I will create this in /scratch/ partition as it is faster compared to /lustrehome/ partition.
(modulus) [s.1915438@sl1 ~]$ cd /scratch/s.1915438
(modulus) [s.1915438@sl2 s.1915438]$ mkdir env
(modulus) [s.1915438@sl2 s.1915438]$ ls
ansys195 env jupyter_env.sh jupyter_log jupyter.sh modulus Modulus_examples Modulus_source
(modulus) [s.1915438@sl2 s.1915438]$ cd env
(modulus) [s.1915438@sl2 env]$ python3 -m venv modulus
(modulus) [s.1915438@sl2 env]$
Now it is time to close the conda environment. The simplest way is to reestablish the ssh connection.
Running Python virtual environment#
A Python environment can be activate using this command:
[s.1915438@sl1 ~]$ cd /scratch/s.1915438
[s.1915438@sl1 s.1915438]$ source env/modulus/bin/activate
(modulus) [s.1915438@sl1 s.1915438]$
Now we can check the Python version:
(modulus) [s.1915438@sl1 s.1915438]$ which python
/scratch/s.1915438/env/modulus/bin/python
(modulus) [s.1915438@sl1 s.1915438]$ python --version
Python 3.9.12
(modulus) [s.1915438@sl1 s.1915438]$
Installing Pytorch#
Remember to install correct version of pytorch for Nvidia A100. Version '1.11.0+cu102' i.e. 1.11 with CUDA 10.2 is incompatible and you will see the following error.
(modulus) [s.1915438@sl2 helmholtz]$ srun python helmholtz.py
/scratch/s.1915438/env/modulus/lib/python3.9/site-packages/torch/cuda/__init__.py:145: UserWarning:
NVIDIA A100-PCIE-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
So, install a later version such as '1.11.0+cu113' using pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113.
Installing Nvidia Modulus v22.03#
A requirements.txt file is present in this directory. It contains the command to install prerequisites for Modulus. Please, do not follow Nvidia’s online instructions.
pip3 install matplotlib transforms3d future typing numpy quadpy numpy-stl==2.11.2 h5py sympy==1.5.1 termcolor psutil symengine==0.6.1 numba Cython chaospy torch_optimizer vtk chaospy termcolor omegaconf hydra-core einops timm tensorboard pandas orthopy ndim
pip3 install -U https://github.com/paulo-herrera/PyEVTK/archive/v1.1.2.tar.gz
Go to the Nvidia Modulus’s source directory and install Modulus on modulus virtual environment.
[s.1915438@sl1 Modulus]$ ls
accompanying_licences build changelog_tensorflow.md dist Dockerfile external MANIFEST.in modulus modulus.egg-info NVIDIA-OptiX-SDK-7.0.0-linux64.sh README.md requirements.txt setup.cfg setup.py
[s.1915438@sl1 Modulus]$ pwd
/scratch/s.1915438/Modulus_source/Modulus
[s.1915438@sl1 Modulus]$ python setup.py install
After some time you should see a success message
Using /scratch/s.1915438/modulus/lib/python3.9/site-packages
Finished processing dependencies for modulus==22.3
Installing PySDF#
Copy PySDF files from previous i.e. from v21.06 ./Modulus/external/pysdf and paste it ./Modulus/external. I am doing this because, Python 3.9 no longer supports installation of egg files using easy_install which is the default method to install PySDF in Modulus v22.03.
Now we can proceed with the older instructions, from the older manual as follows.
(/scratch/s.1915438/modulus) [s.1915438@sl1 Modulus]$ pwd
/scratch/s.1915438/Modulus_source/Modulus
(/scratch/s.1915438/modulus) [s.1915438@sl1 Modulus]$ cd external/
(/scratch/s.1915438/modulus) [s.1915438@sl1 external]$ ls
eggs lib pysdf
(/scratch/s.1915438/modulus) [s.1915438@sl1 external]$ export LD_LIBRARY_PATH=$(pwd)/pysdf/:${LD_LIBRARY_PATH}
Now install PySDF
(modulus) [s.1915438@sl2 pysdf]$ pwd
/scratch/s.1915438/Modulus_source/Modulus/external/pysdf
(modulus) [s.1915438@sl2 pysdf]$ python setup.py install
after some time you will see
Installed /scratch/s.1915438/env/modulus/lib/python3.9/site-packages/pysdf-0.1-py3.9-linux-x86_64.egg
Processing dependencies for pysdf==0.1
Finished processing dependencies for pysdf==0.1
Running an interactive GPU session#
salloc --nodes=1 --account=scw1901 --partition=accel_ai --gres=gpu:2
set the Number of GPU as you wish, number of CPU does not matter here.
(modulus) [s.1915438@sl2 helmholtz]$ salloc --nodes=1 --account=scw1901 --partition=accel_ai --gres=gpu:1
salloc: Granted job allocation 7161838
salloc: Waiting for resource configuration
salloc: Nodes scs2041 are ready for job
We can see our job in two ways. Using squeue --user=s.1915438 or squeue --partition=accel_ai.
[s.1915438@sl2 ~]$ squeue --partition=accel_ai
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
7161842 accel_ai bash s.191543 R 0:38 1 scs2041
7161825 accel_ai Eval_ens a.bip5 R 1:08:17 1 scs2041
Running Nvidia Modulus example#
We can use srun to run any Python on GPU as follows:
(modulus) [s.1915438@sl2 seismic_wave]$ srun python wave_2d.py
training:
max_steps: 40000
grad_agg_freq: 1
rec_results_freq: 1000
:
<Output continues>
Cancelling model training#
Nvidia Modulus trains the model forever and stores the data in checkpoint folder. We can cancel the training anytime or when the loss is satisfactory using pressing ctrl+c multiple times.
Can’t run SDF library and STL file support.#
This is something I have to look at. For now here is the error.
(modulus) [s.1915438@sl1 s.1915438]$ cd Modulus_examples/examples/aneurysm/
(modulus) [s.1915438@sl1 aneurysm]$ ls
aneurysm.py conf openfoam stl_files
(modulus) [s.1915438@sl1 aneurysm]$ srun python aneurysm.py
Error importing pysdf. Make sure 'libsdf.so' is in LD_LIBRARY_PATH and pysdf is installed
Traceback (most recent call last):
File "/scratch/s.1915438/Modulus_examples/examples/aneurysm/aneurysm.py", line 25, in <module>
from modulus.geometry.tessellation.tessellation import Tessellation
File "/scratch/s.1915438/env/modulus/lib/python3.9/site-packages/modulus-22.3-py3.9.egg/modulus/geometry/tessellation/tessellation.py", line 11, in <module>
import pysdf.sdf as pysdf
ImportError: libsdf.so: cannot open shared object file: No such file or directory
srun: error: scs2041: task 0: Exited with exit code 1